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Abstract 

We present a novel Monte Carlo simulation of protein folding, in which all heavy atoms are repre- 
sented as interacting hard spheres. This model includes all degrees of freedom relevant to folding 
- all sidechain and backbone torsions - and uses a Go potential. In this study, we focus on the 46 
residue a/ (3 protein crambin and two of its structural components, the helix and helix hairpin. For a 
wide range of temperatures, we have recorded multiple folding events of these three structures from 
random coils to native conformations that differ by less than 1 A dRMS from their crystal structure 
coordinates. The thermodynamics and kinetic mechanism of the helix-coil transition obtained from 
our simulation shows excellent agreement with currently available experimental and molecular dy- 
namics data. Based on insights obtained from folding its smaller structural components, a possible 
folding mechanism for crambin is proposed. We observe that the folding occurs via a cooperative, 
first order-like process, and that many folding pathways to the native state exist. One particular 
sequence of events constitutes a "fast-folding" pathway where kinetic traps are avoided. At very low 
temperatures, a kinetic trap arising from the incorrect packing of sidechains was observed. These 
results demonstrate that folding to the native state can be observed in a reasonable amount of 
time on desktop computers even when an all-atom representation is used, provided the energetics 
sufficiently stabilize the native state. 
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Introduction 



Previous lattice and off-lattice folding simulations featured coarse-grained protein representa- 
tions where amino acids are often modeled as spheres ( [Sali et al., 1994| ; |Shakhnovich, 1997| ; [Pande 
et a f., 1998| ; pill et al., 1995| ; [Bryngelson et al., 1995| ; [Klimov fc 'I'hirumalai, 1996| ; [Honeycutt & 
Thii |umalai, 199!^ ; [Berriz et al., 199^ ; [Zhou fc Karplus, 1999| ; [Clementi et al., 200(J| ). While such 
simplified representations have their advantages, such as the reduction in the number of degrees of 
freedom, there are several shortcomings. First, one could argue that such models do not capture 
the full complexity of conformational space, thereby not truly addressing the Levinthal problem. 
Second, coarse-grained models may lack some realistic features of secondary structure elements. 
Finally, such models do not address the packing of sidechains in the protein interior, which is often 
viewed as an important aspect of the folding process, given the diversity of sidechain shapes and 
their dense packing in the native state ( [Ponder fc Richards, 19871 ). 

Molecular dynamics approaches have attempted to bridge the gap between simulation and re- 
ality by using atomic representations of proteins with explicit solvent molecules. Unfortunately, 
the computational times for such minimally coarse-grained simulations still remains prohibitively 
high as evidenced by a recent work that yielded only a partial folding trajectory ( Puan fc Kollman 



199^) even with the aid of a massively parallel supercomputer. An ensemble of complete folding 
trajectories is required in order to gain meaningful physical insights, especially as our theoretical un- 
derstanding of protein folding has become increasingly grounded on statistical mechanical principles 
([Shakhnovich, 19"97| ; [Unuchic et al., 1997| ; [Pande et al., 2000| ) . Other molecular dynamics research 
efforts have tried to address this issue by obtaining multiple unfolding runs ( [Li fc Daggett, 1996| ; 
Lazaridis fc Karplus, 1997[ ) at extremely high temperatures. Unfortunately, it is unclear whether 



the observed unfolding pathways are indicative of folding pathways normal conditions ( [Finkelstein 
1991). 



We present a Monte Carlo (MC) simulation ( Binder fc Heerman, 199^) which combines an all- 



atom description of the protein with coarse-grained motions and energetics. In this simulation, 
(1) all heavy atoms in the protein are represented as impenetrable spheres, (2) all backbone and 
sidechain torsions, which account for all degrees of freedom relevant to folding, are allowed to move, 
and (3) a square well. Go potential is used for the interaction energy. There are several advantages 
to this method. First, a statistically significant number of complete folding trajectories can be 
collected using conventional computational resources. Second, the atomic-level resolution of the 
simulation yields detailed descriptions of the folding process of actual protein structures including 
sidechain packing. Finally, this coarse-grained approach allows for a systematic investigation of 
the physical principles dictating protein folding. If one begins with a detailed energy function as in 
molecular dynamics, it may be difficult to deconvolute exactly which energy terms were essential (or 
inessential) for folding. On the other hand, as was demonstrated in theoretical investigations ( Gros- 



ber ^ fc Khokhlov, 1994| ), by thoroughly investigating the successes and limitations of coarse-grained 



models, it is possible to test which features of a model are necessary and/or sufficient for describing 
the complex physics of heteropolymers. 

Using our simulation, we have repeatedly folded the 46-residue a — P protein crambin (Figure |I]) 
to within 1 A backbone dRMS (computed over all backbone atoms). Furthermore, the thermody- 
namic and kinetic folding properties obtained from our simulation are consistent with experimental 
studies of other single domain proteins ( |Grantcharova fc Baker, 1997| ; [Jackson fc Fersht, 1991| ), for 



which the folding transition generally exhibits two-state behavior with no accumulating intermedi- 
ates between the denatured and native states ( [Jackson, 1998| ). 



One of the more successful methods reported in the hterature generated folded structures for 
crambin to ~ 3 A C-a dRMS ( |Kolinski fc Skolnick, 199^) using a sequence-based potential. These 



final structures were obtained from a two-step heuristic approach: simplified structures obtained 
from folding runs on a coarser lattice were then refined on a finer lattice. In contrast, our method 
uses a potential based on knowledge of the native structure, but generates the entire folding transi- 
tion without altering key conditions (such as the move set, temperature, or protein representation) 
once the simulation is initiated. As such, it produces an ensemble of trajectories which can yield 
thermodynamic and kinetic information. Although the potential we employ cannot be used to fold 
arbitrary protein sequences, we demonstrate that the full conformational search problem can be 
solved using standard computer resources. 

Theory and Motivation 



Analytical studies ( [Bryngelson fc Wolynes, 1987 ; Ramanathan fc Shakhnovich, 199'^ ; Shakhnovich 



1991; P^a-iide et al., 200"(]| ) and lattice simulations ( |Shakhnovich fc Gutin, 1993| ; [Shakhnovich, 19^ 



determined that only certain sequences can fold in a biologically relevant time scale. Such fast- 
folding sequences featured the native conformation as the pronounced energy minimum. For exam- 
ple, when evolution-like pressures were applied to lattice protein sequences to preferentially select 
mutations that improved folding speed (analogous to real-life evolutionary pressures to select pep- 
tide sequences that survive proteolysis), the resulting sequences featured a large energy gap between 
their native and competing non-native states ( |Gutin et al., 1995]) ). Conversely, in order to success- 



fully simulate the folding of biologically occurring protein sequences, it is necessary to use a model 
that places the native state at the bottom of the energy spectrum. 



As noted in the Introduction, our folding model combines a near-perfect geometrical represen- 
tation of the peptide chain with a potential energy function that delivers the required "energy gap" 
by explicitly assigning the native conformation as the ground state (see Methods). In general, a 
Go potential is often viewed as providing an idealized energy landscape by strongly biasing folding: 
first, there is a strong correlation between energy and structural distance (e.g. C-a RMS) from the 
native state; and second, the potential energy surface is seen as very smooth and downhill when 
going from the unfolded to the native state. 

Despite this criticism, important aspects about folding kinetics can be learned as demonstrated 
by several recent studies (|Clementi et ai, 2000| ; Pande fc Rokhsar, 1999| ; |Zhou fc Karplus, 1999|) . 



One particularly interesting aspect is the role topology may play during the folding process ([Baker 
200(|). Since folding pathways are determined by the barriers and local minima encountered on a 
free energy landscape, non-trivial folding kinetics may be observed in a model that allows for confor- 
mational entropy to compete with the Go energy at the relevant temperatures. The predominance 
for frustration due to topology can be problematic for long enough chains, but certainly becomes 
pronounced when regular geometrical elements such as a-helices and /3-sheets must be formed by 
bringing together distant parts of the chain. In fact, we demonstrate below that if realistic protein 
geometry is used, severe kinetic traps due to topological frustration make the folding kinetics ex- 
tremely complex. 



The move set 



Our main concern when choosing the move set was to balance realism and efficiency. The natural 
motions of a solvated polymer chain at room temperature occur via torsional degrees of freedom. 
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However, it appeared that using only single torsional rotations may be unrealistic, since it would 
lead to large conformational changes of the polymer chain. In solvent, large moves of such type 
are rare because of the drag experienced by the moving chain. A worse situation is encountered 
in the compact state, as large rotations are prohibited because of excluded volume restrictions. To 
properly address these issues, the torsional move set we chose has two important features. First, 
all of our moves consist of either two, four or six concerted rotations. This allows for a variety of 
"loop" -like moves, which are particularly important when in the compact state. These rotations 
were required to be within 6 residues of each other, in order to ensure that the majority of moves 
resulted in local conformational changes. Second, the step sizes for these concerted rotations were 
drawn from a Gaussian distribution of relatively small width. More specificially, the probability for 
selecting a backbone step size of 6 was given by 



where a = 2°. This ensures that a random walk is performed in torsional space, which should 
likewise result in a random walk in conformational space. The vast majority of step sizes are small 
enough that the conformational changes induced by our moves are local. Very rarely, when a for- 
tutitous combination of torsions and rotations are chosen and the chain is in an extended state, a 
global conformational change is allowed. During all of our simulations, such global conformational 
changes were not observed in the compact state: large conformational changes were always accom- 
panied by partial unfolding events. 

Can our Monte Carlo simulation yield information about folding kinetics? 
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MC simulations have been used in a wide range of problems in statistical physics to obtain 
equilibrium information ( [Binder fc Heerman, 1992" ). As long as a MC simulation's move set satisfies 



the criterion of detailed balance, convergence to a canonical distribution is guaranteed. A much 
more controversial aspect of MC simulations has been its use in determining the relaxation kinetics 
to an equilibrium distribution. Because there is no explicit correspondence with real time in a MC 
simulation, one might raise the objection that all kinetic interpretations of MC trajectories are 
dependent on the particular move set chosen. 

For simple MC lattice folding simulations utilizing the standard Verdier move set (consisting of 
the crankshaft, corner and tail flips) ( [Verdier, 1973|) , it was demonstrated that general conclusions 
on folding kinetics, such as the importance of specific nucleation for fast-folding, could be drawn 
if the analysis was performed on an ensemble of folding runs ( jAbkevich et al., 1994a[ ; jAbkevicE 



et a f., 1994"5| ). This approach was supported by an earlier study, which analytically showed that 



the coil-globule relaxation time, as computed by the use of the Verdier and other local move sets, 
matched theoretical predictions( [Hilhorst fc Deutch, 1975| ). Unfortunately, such rigorous analyses 
are extremely difficult to complete as the complexity of the system being simulated increases. 

To justify our use of MC to interpret folding kinetics, we provide the following heuristic ar- 
gument. Let W{s) = {wi{s)} represent the probability distribution of states at step s of a MC 
simulation. We can then represent the progression in our MC simulation as the Markov process 

W{s + 1) = P-W{s) 

where P is the transition matrix, whose i —j th element is given by the probability to go from state i 
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to j. In a Metropolis MC simulation, the individual elements of this transition matrix obey detailed 
balance, which means it is the equilibrium solution to the series of first-order kinetic equations (also 
known as the master equations) 

dwi sr^ r-, n 

— = 2^ -PijWi + PjiWj = 

j 

. Now consider a move set that is sufficiently local and unbiased towards the native state. During 
the relaxation to equilibrium, the ensemble population will be highest where there are free energy 
minima. The evolution of the MC simulation thus leads to a gradual shifting of the ensemble pop- 
ulation from local minima to the global minimum, the native state. If kinetic events are separated 
by a large enough number of local moves, and they are observed in an ensemble of relaxation tra- 
jectories, they represent significant state population shifts and reflect properties of the free energy 
landscape (e.g., the depth of local minima and barrier heights). In this manner, examining an 
ensemble of trajectories highlights the major kinetic events during folding by averaging out short 
MC time events. Importantly, this has the effect of minimizing differences arising from the selection 
of a particular move set. 

The kinetic picture arising from a MC simulation is thus coarse-grained. If a common free 
energy landscape is used, both MC and other types of simulations should agree on the sequence of 
major folding events, if they are sufficiently separated in time. This was demonstrated by Rey and 
Skolnick (1991), who concluded that a MC simulation yielded the same folding pathways for an 
a-helical hairpin as a Brownian dynamics simulation, for which there was real time evolution. We 
emphasize that MC simulations are not immediately justified in making quantitative predictions 
or microscopic analyses of folding kinetics. In order to reliably predict folding rates, every MC 
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simulation needs to be extensively calibrated to existing experimental data. It is likely that such 
calibration will rule out several move sets. 

Finally, perhaps the most important requirement of achieving quantitative agreement with exper- 
imental kinetic data is that the free energy landscape must be sufficiently realistic. Even molecular 
and Langevin dynamics simulations, where a well-defined time step is used to evolve the simulation, 
will be unable to make quantitative - and perhaps even qualitative - analyses of folding kinetics 
if this requirement is not met. Because there have been no reports of a successful ab initio fold- 
ing potential, we decided that the Go potential, combined with an all-atom representation of the 
protein, was the best choice for this simulation. 

Before proceeding to analyze the folding kinetics of crambin, we decided to more accurately 
assess the validity of our move set by examining the thermodynamics and kinetics of the helix-coil 
transition, for which there are experimental and molecular dynamics data. 



Results 



The formation of the {i,i + 4) hydrogen bonding scheme in helical structures results in a high 
native contact density. As a result, a nondiscriminate use of Go potentials will lead to unusually 
strong energetic biases towards helices. We thus started by calibrating the stability of the two cram- 
bin hehces with the predictions of AGADIR ([Munoz fc Serrano, 1991 ) (at pH = 7 and T = 298 K): 



at temperature relevant for our crambin folding simulations, both helices had zero helical propensi- 
ties. Elimination of backbone contacts was the simplest way to dramatically reduce helix stability 
without introducing biases due to sequence identity. All subsequent data was collected while using 
this slightly modified Go potential. 
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Helix-coil transition 



Both helix 1 and 2 (see Figure |I| for sequences) exhibited an abrupt transition to the coil state, 
with the transition temperatures occurring at T^^lix 1 ~ i 2 and T^^lix 2 ^ g respectively 
(Figures |^ & b). The energy histograms show a sudden shift in helix and coil state populations as 
rphelix 1 passed, with a broadening of the distribution at T^^li^ 1 (Figure H). The free energy 
curves clearly demonstrate the emergence of a local, second minimum centered at 3.5 A backbone 
dRMS for 1.1 < T < T^^lix 1 (^pjg^j-g |]q) At xj^^lix 1^ ^^^is second minimum becomes equal in 
free energy to the helix state minimum, resulting in an equilibrium of both helix and coiled states. 
Finally, the heat capacities for both helices are sharply peaked around Tf (Figure ^). Based on the 
width of the heat capacity peak at Tf, it appears that the helix 2 shows a weaker transition, which 
is expected for a shorter helix. These thermodynamic observations fully agree with Zimm-Bragg 
theory ( Zimm fc Bragg, 1959 ), simulations ( [Hummer et al., 2000 ; Daura et al., 199U ; Daggett 



Levi |tt, 1992| ), and experiment ( [I'hompson et al., 1997] ). 



The median first passage time (FPT) for helix 1 rose rapidly as T was increased for T > 1.1 
(Figure |a). Given that the Go energy more or less increases with increasing dRMS, it is clear that 
the free energy barrier to helix formation (Figure ^c) is purely entropic. Interestingly, for T < 1.1, 
the helix 1 formation rate becomes weakly dependent on temperature, suggesting that there are no 
major free energy barriers. At all temperatures, the distribution of FPTs exhibited long tails, but 
more data was needed to accurately determine whether the distribution was non-exponential. As 
noted by ( |Hummer et al., 2000| ; [Hummer et al., 200 1| ), the formation rate appears to be dominated 



by a diffusive search for a nucleation event. Once the nucleus is formed, the formation of the rest 

11 



of the helix is rapid (see the trajectory at T ^ T/ in Figure |^). 

Examination of the ensemble helix 1 formation kinetics shows that ALA 9 is the dominant 
nucleation site at both low and high temperatures (Figures ^-e). This is shown by the rise in the 
fraction of native contacts (Q) made by the amide nitrogen of ALA 9. It is necessarily accompanied 
by rise in the carbonyl oxygen Q of SER 6, as ALA 9 and SER 6 make key sidechain-sidechain 
contacts to initiate the formation of the first turn. A logical explanation for this dominant nucleation 
event is the absence of sidechain entropy at ALA 9 in our model, thus lowering the free energy cost 
of constraining this residue in a helical conformation. Upon nucleation, at higher temperatures, the 
propagation appears to rapidly proceed towards the N-terminus (Figures ^ & e), while at lower 
temperatures, this asymmetry seems to be weaker (Figures & c). This may be explained by the 
larger entropic cost of constraining the large sidechain of ARG 10 at higher temperatures. A similar 
observation was made by Pande et al. (personal communication, 2000), whose simulation showed 
that helix propagation was slowed by the presence of arginine residues in a poly-alanine helix. The 
two exponential relaxations processes detected in the A8-R-A4 peptide by Thompson et al. (1997) 
may be explained by a similar phenomenon: the fast quenching of the N-terminus fluorescent label 
is due to the rapid nucleation/propagation of the alanine rich regions (from both N- and C-termini) 
and the slow phase due to the slow incorporation of the arginine residue into the growing helix. 

At lower temperatures, SER 11 appears as a competing nucleation site. Like ALA 9, SER 11 
is a low entropy residue. However, in order for SER 11 to make helical contacts with neighboring 
residues, it requires constraining at least one of three large sidechains (ARG 10, ASN 12, PHE 13), 
which may be unfavorable for entropic or steric reasons. In contrast, forming the SER 6 - ALA 9 
interaction requires constraining smaller residues. For this reason, it seems to be more favorable to 
nucleate the helix at ALA 9. In support of this idea, we note that SER 11 disappears as a nucleation 
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site at higher temperatures. 

A third nucleation site occurs at LEU 18. This appears to be driven primarily by energy, as 
many native sidechain-sidechain contacts are made with VAL 15. However, as the effect of sidechain 
entropy is increased as temperature is raised, we see that the VAL 15 - LEU 18 interaction forms 
slower relative to the SER 6 - ALA 9 interaction than at lower temperatures. Interestingly, raising 
the temperature seems to have a larger effect in reducing the capacity of SER 11 to nucleate the 
helix compared to LEU 18. 

Although the propagation asymmetry predicted by Young et al. (1996) was not observed at 
all temperatures, the philosophy behind their general approach seems to be correct: preferences in 
helix propagation and nucleation are determined by the specific atom-atom interactions. 



Folding simulations of crambin 



Given that our move set yielded reasonable kinetic and thermodynamic helix-coil transition data, 
we proceeded to fold crambin. Starting from random coils, we considered the protein folded if the 
following three criteria were satisfied for at least 5 x 10^ steps: (1) the fraction of native contacts {Q) 
exceeded 0.7; (2) the backbone dRMS was less than 1.25 A; (3) the fraction of native contacts in the 
four secondary /tertiary structural features (Qi, 1 < « < 4) exceeded 0.5 (see Figure |I|). Criterion 
(1), which is similar to the one used to signify a folding event in lattice simulations ( Abkevich et al. 



199^^, could not by itself distinguish conformations that one might intuitively consider folded (i.e., 
properly formed secondary structure and low overall dRMS) from obvious misfolds. Although our 
definition for folding did not measure when equilibrium was attained, it was useful for identifying 
when the major folding event - the transition from the random coil to a near-native state - occurred. 
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Thermodynamics 



The folding thermodynamics is aptly described by a first order-like transition (|Hao fc Scheraga" 



199^) (Figure with the MC folding transition temperature (Tj) estimated to be between 2.1 
and 2.3. For high temperatures (2.0 < T < 3.0), the data is well-fit by the exponential function 
typically used to describe two-state folding behavior ([Fersht, 1999|) . This is in general agreement 
with the folding thermodynamics obtained with lattice models QShakhnovich, 1997 ). By matching 
the backbone dRMS value from NMR measurements (1.48 A) ( pCu et al, 1999|) , we estimate that 
room temperature corresponds to 2.1 < T < 2.2, which is consistent with the general observation 
that most proteins are marginally stable at room temperature ( preighton, 19921) . 



Kinetics 



Representative folding runs are shown in Figures |^a-e. Although different folding pathways 
were observed (Figure |^), a fast-folding pathway was found where kinetic traps were avoided. The 
ensemble of trajectories following this fast-folding pathway was characterized by a definite sequence 
of events: (1) formation of the inter-helix contacts (event 2 in Figure (2) formation of the 
two helices (event 3); and (3) formation of the /3-sheet (events 4 and 5). This observation can be 
rationalized from simple topological considerations: the helices are most easily formed when the 
two ends of the polymer are not constrained by the /5-sheet. 

At high temperatures (1.7 < T < 1.8), at least half of the folding trajectories did not collapse 
within 10^ steps, making clear that collapse is the rate- limiting step (Figures ^). The kinetics 
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appear to be two state, as evidenced by the narrowly distributed near-native and unfolded popu- 
lations, rapid collapse events, and no accumulating intermediates. The transiently populated state 
with energy -150 corresponds to event 2 of the fast-folding pathway (Figure 0) where a subset of the 
inter-helix and helix contacts have been formed. At middle temperatures (1.55 < T < 1.675), this 
intermediate accumulates for longer times and the collapsed state ensemble broadens (Figure |^). 
At low temperatures (T < 1.525) this effect becomes particularly pronounced, with high energy, 
collapsed states accumulating for very long times (Figure D^). While collapse is very fast at these 
temperatures, the persistently broad distribution of low dRMS (< 2 A) states indicates that the 
compact state is riddled with deep traps. Finally, the median folding time rapidly increased for 
T < 1.4 and T > 1.8, and a broad minimum existed at 1.5 < T < 1.65. More runs are needed to 
narrow the fastest folding temperature to a smaller range. 



Role of helix stability in crambin folding 



The zero helix stability at crambin folding temperatures likely explains why the diffusion-collision 
scenario ( [Karplus fc Weaver, 1976|) was not observed as the major fast-folding pathway; in fact. 



only 1 out of over 150 runs which folded had the helices forming prior to the inter-helix contacts. It 
is emphasized that the present move set is not likely to contain any biases against diffusion-collision 
kinetics. Unlike lattice MC simulations, in which helices can reorient only by partial unfolding, 
our model allows entire secondary structure elements to move as units. To confirm this point, we 
performed folding simulations of the two helix hairpin motif (residues 6-30) in isolation. 

In general, if the temperature is sufficiently lowered, we readily observed diffusion-collision 
kinetics. A representative diffusion-collision folding trajectory of the hairpin at T=0.4 is shown in 
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Figure From the ensemble data, it is clear that the diffusion-collision scenario emerges as the 
dominant kinetic pathway if helix formation is rapid and the helical state is stable (Figures ^-c). 
This requires that the temperature be sufficiently low (T < 0.8). Since marginally stable /5— hairpin 
formation is inherently slow ( [^'inkelstein, 1991D , at low temperatures where helices fold fast, there is 



a clear separation of time scales: the helices form first, followed by the hairpin. Given that folding 
occurs at temperatures where helices are unstable, we must thus rule out diffusion-collision as an 
important kinetic pathway for this protein 

Using a C-a off-lattice model with a Go potential, Zhou et al. observed that for a three-helix 
bundle the "fast track" folding pathway, where no kinetic intermediates are encountered, was the 
diffusion-collision pathway. In light of our hairpin data, this result is not surprising. The three-helix 
bundle was being folded at extremely low temperatures: for a bias gap of 1.3 (native interaction 
= -1; non-native interactions = +0.3) , the protein was being folded at a temperature that was ~ 
25% of the collapse transition temperature (Tj). On our scale, we began to observe crossover to 
diffusion-collision behavior at 0.8/2.2 k, 0.36T/-. Although the parameters of the Go potential are 
slightly different, it is likely that Zhou et al. were working at extremely low temperatures where 
both the helices and the protein were very stable. If we estimate Tf for actual proteins to be 
roughly 350 K, this implies that the folding simulations by Zhou et al. were completed at less than 
100 K. It is also known experimentally that only the third helix of the bundle is marginally stable 
(~ 30% helical content) under normal folding conditions ( Bai et al., 1997 ). Folding a similar three- 



helix bundle using Langevin dynamics, Berriz and Shakhnovich found that the diffusion-collision 
behavior was observed at low temperatures but disappeared as the temperature was raised (GB and 
ES, unpublished data). For these reasons, it will be interesting to see if diffusion- collision behavior 
continues to be observed by Zhou et al. at higher temperatures. 
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Characterization of traps 

The kinetic traps observed were either backbone misfolds, where the secondary structure ele- 
ments were not properly formed, or compact conformations with incorrectly packed sidechains. The 
first type of backbone misfold resulted when the /5-sheet folded incorrectly after the helix-turn-helix 
moiety was formed. Correction of this misfold required partial or complete unfolding of the /3-sheet. 
At high temperatures (1.7 < T), this misfold was metastable, as the /5-sheet repeatedly folded and 
unfolded until the correct topology and packing was achieved (Figure ^). In general, forming the 
sheet was observed to be the rate-limiting step at low temperatures (T < 1.5) (see Figure ^c). The 
other type of backbone misfold occurred when the two helices were not formed properly prior to the 
formation of the (3-sheet. Two pathways (A and B in Figure 0) were available at all temperature 
to correct these helix misfolds. Pathway A required the /3-sheet to partially or completely unfold. 
In contrast, pathway B corrected the helix while keeping the sheet intact. This pathway was ac- 
celerated with increasing temperature, as the "breathing motion" resulting from greater backbone 
fluctuations facilitated reinsertion of the helix sidechains (compare, for example, the Q4 fluctuations 
in Figures |c and d). 

The traps resulting from incorrectly packed sidechains (Figure |^) are characterized by low 
backbone dRMS and correct backbone topology (high Qi's) and were observed only at low tem- 
peratures. At higher temperatures, after the major folding event, equilibrium sidechain packing is 
rapidly achieved. As shown in Figure for T <T* 1.6, near-native backbones (dRMS < 1.25 
and Qi's > 0.6) obtained from folding runs would not relax further to match the energies attained 
in runs initiated from the native state. This suggests that T* may signify a kinetic transition tem- 
perature, where ergodicity is broken and a gap emerges between measurements taken from finite 
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but long unfolding and folding runs. At low temperatures, it appears that the major backbone fold- 
ing event traps sidechains in disordered non-native conformations, which cannot be readily relaxed 
because of insufficient backbone fluctuations. 



Discussion 



Compared to molecular dynamics studies of solvated folding (|Duan fc Kollman, 1998|) , the ap 



proach used in this study is still minimalist. Yet, we have demonstrated that important insights 
into the folding process, such as the role of sidechain packing, may be obtained by properly combin- 
ing an all-atom description with simple, atomic resolution energetics. Importantly, our simulation 
can record a statistically significant number of folding events at atomic resolution for real protein 
sequences, thereby allowing a relatively direct comparision between simulation and experimental 
results. 

Unfortunately, because of the lack of experimental folding studies on crambin, our results 
presently cannot be directly verified. They should be viewed as theoretical predictions which may 
be tested by future experiments. We selected crambin for this study because of its small size and 
its non-trivial a — [3 structure. We note that as the temperature approaches Tj, the folding kinetics 
of crambin approaches two state behavior, in agreement with experimental studies of most small 
single domain proteins ([Jackson, 1998| ). It is plausible that under a Go potential, where all native 



interactions are treated identically, the folding properties of crambin should be consistent with those 
of similar size and topology. 

Furthermore, the modern viewpoint that nucleation is a major event in the folding kinetics 
( [bersht, 1995| ; [Sosnick et at., 19^ ; [Abkevich et al., 1994 bj ) is consistent with our results: at temper- 
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atures just below Tf, collapse via formation of the /5-slieet is the major kinetic barrier (Figure ^). 
A transiently populated, partially collapsed intermediate in which the helix hairpin is formed is 
seen (Figure and is probably necessary for proper sheet formation. Once the sheet has formed, 
the chain is compact, and folding proceeds rapidly with concomitant sidechain packing. With the 
individual helices unstable in isolation at folding temperatures, the hairpin formation did not follow 
a diffusion-collision pathway. Rather, since the necessary event for hairpin formation is the local- 
ization of inter- helix contacts, the kinetic mechanism is also likely to be a nucleation event, similar 
to the one theorized for /3-hairpins ( [Finkelstein, 19911) . 

In addition, structural elements that bring together residues that are closer in position along 
the chain (such as the helix hairpin) appear to form faster than those which bring together residues 
distant in sequence position. We believe this is similar to the observation made by Plaxco et al. 
that folding kinetic rates are correlated with the relative contact order of the native state structure 
( IPlaxco et aL, 1998| ). 

Even with a potential strongly biased towards proper folding, the presence of diverse sidechain 
geometries and excluded volume interactions can lead to the presence of severe kinetic traps, as we 
observe at very low temperatures. However, our results demonstrate that at reasonable temper- 
atures sidechains can be successfully packed in a manner consistent with a low dRMS native-like 
backbone conformation. We believe that the move set we employed contributed to the success of 
our simulation. Efficient sampling in the compact state, as evidenced by the > 10% acceptance rate 
even for very low temperatures, resulted from both global and local conformational changes being 
permitted, with global changes becoming more available with higher temperatures. The qualita- 
tive folding behaviour is consistent with experimental observations of small single domain proteins, 
suggesting that the essential features of a polypeptide with all torsional degrees of freedom are 
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captured by this move set. 

It is important to note that the hehx-coil transition kinetics obtained from our simple model 
qualitatively showed agreement with experiment and molecular dynamics simulations. Furthermore, 
whereas current state-of-the-art molecular dynamics simulations focus on extremely short helical 
segments (penta- and heptapeptides) to study the the helix-coil transition ( |Hummer et al., 2001 



Daura et al., we have been able to carry out statistical mechanical analyses on full length 

helices. The kinetics of early events during folding, such as helix formation, which occur on the 
timescale of hundreds of nanoseconds, can thus be investigated with our model provided the folding 
mechanism is examined from an ensemble viewpoint. Since the MFPT for helix formation is roughly 
1 million MC steps, this puts the folding of crambin (50-100 million MC steps) in the microsecond 
range, which is reasonable for a protein of its size. Recent data suggests that the protein G 
/3— hairpin takes 50 million MC steps to fold (JS, ELK, EIS, unpublished data), which places it 
in the microsecond range to fold in real time, nearly matching the ~ 10 microsecond rate observed 
experimentally (Munoz et al., 1997). Furthermore, early data on protein G folding (JS, ELK, 



EIS, unpublished data) indicates a folding rate of a few billion MC steps. This is one order of 
magnitude faster than the experimental rate (a few milliseconds, ([McCallister et al, 2000|) ) but 



the qualitative agreement is promising. It particularly encouraging to note that the folding rates 
of secondary structure elements as observed in our simulation are properly separated in timescales. 
This justifies the use of our simulation to draw qualitative conclusions about kinetic events involving 
the formation and/or organization of secondary structure elements. 

We intend to carry out similar studies on larger single domain proteins for which there are 
extensive experimental data. The faster folding times we have preliminarily observed for protein 
G is undoubtedly because of the Go landscape, which presents an idealistic, perfectly downhill 
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energy landscape. With the development of a sequence-based potential for our model (currently in 
progress), we expect our correspondence with experimental rates will improve. Taking our results 
as a proof-of-concept, we believe that detailed investigations of folding pathways may finally be 
possible using ordinary computational resources. 



Methods 



Full atom representation 

Each non-hydrogen atom present in the crambin crystal structure ([Teeter et ai, 199^ ) (Brookhaven 



PDB accession code: lABl) was represented by a hard sphere, whose size was given by scaling the 
relevant VdW radius (r) from ref. ( P?sai et ai, 19991) by a factor a{< 1). Helix 1 and helix 2 na- 
tive structures were obtained by extracting residues 6-18 and 23-30, respectively, from the crambin 
crystal structure. 



Move set 

A single MC step consisted of a backbone move followed by 10 sidechain moves. Each backbone 



and sidechain move was accepted according to the Metropolis criterion ([Metropolis et al, 1953|) . 
A backbone move consisted of rotating the (f) — ip angles of up to 3 non-proline residues from a 
randomly selected window of 6 consecutive residues. A sidechain move consisted of rotating all 
sidechain torsion angles (x) of a randomly selected non-proline residue. The size of the backbone 
and sidechain rotations were obtained from a Gaussian distribution with zero mean and standard 
deviation 2 and 10 degrees, respectively. 
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Square well Go potential 



We used an atomic square well potential ( |McQuarrie, 19761) with the well depths given by Go 



energetics ( |Go fc Abe, 1981" ). In particular, for two atoms A and B separated by a distance R, the 



energy e{A, B) was calculated according to 



oo 



R<a 



<AB) 



= < 



A(A,5) a<R<\a 







R> \a 



where a = a{rA + rB) is the hard core distance, A is a scaling factor > 1, and A{A, B) = —1 if A and 
B are in contact in the native conformation and 1 otherwise. The total energy of a conformation 
was computed as the sum over all pairs: 



All atom pairs ofi — i + 1 residues were excluded to eliminate any biases towards local structure, and 
all backbone-backbone contacts were ignored to eliminate non-specific interactions. The energies of 
the disulfide bonds were treated no differently from any other contact. We chose a = 0.75 because 
it was the largest value for which the native structure exhibited no steric clashes. Furthermore, with 
a = 0.75, we could not fold crambin with the sidechain torsions held fixed at their native values, 
suggesting that a was sufficiently large to enforce excluded volume constraints. The selection of 
small A values (< 1.6) significantly increased the time of collapse, while large A values (> 2.0) 
made sidechain packing more degenerate. We therefore selected A = 1.8 in order to balance the two 
effects. This makes the contact distance for methyl carbons to be 5.08 A. 



E 



E <Ai?) 



all pairs 
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Helix-coil transition thermodynamics and kinetics 

Thermodynamic data were collected by sampling uncorrelated states observed along long runs of at 
least 1.5 X 10® MC steps), in order that multiple helix-colil transitions were observed at temperatures 
near Tf. 

Median first passage time data at a given temperature were collected by performing 100 folding 
runs until the fraction of native contacts Q hit 0.7 or 10 x 10^ MC steps had elapsed. For the 
ensemble kinetic data, 100 folding runs of length 5 x 10^ were collected, regardless of whether a 
folding event occurred. 



Free energy calculations 

Using thermodynamic runs, the average energy as a function of backbone dRMS, E{r), was first 
measured. The dRMS of the uncorrelated states was next histogrammed in bins of 0.2 A to compute 
the probability of observing a particular dRMS, p{r). The entropy as a function of backbone dRMS, 
S = \nW{r), was obtained by inverting the statistical mechanical relation 



PVJ = 7^ 



where W{r) is the density of states at a given dRMS and Z is the partition function ([Ferrenberg & 



Swe jidsen, 1988| ). The partition function was explicitly determined from the r = bin by assuming 



that W{r = 0) ~ 1. Finally, the free energy at a temperature T was obtained via the identity 
G{r) = E{r)-TS{r). 
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Crambin folding kinetics and thermodynamics 

Random coils were first generated by unfolding crambin for 3 x 10^ steps with only the excluded 
volume interaction turned on. Both the average energy {E — 77, Q — 0.02) and average structure 
{Rg = 19.5, backbone dRMS = 17.0) of the random coils indicate that these conformations are 
completely unfolded and unstructured. Each random coil was then simulated with the square- well 
Go potential turned on at a particular temperature until it folded or 10® MC steps elapsed. Of the 
250 runs completed for T — 1.25 to 1.875, 165 folded within our observation window of 10® steps. 
For 1.4 < T < 1.8, 135 out of 160 (= 84 %) runs folded. 2.5 x 10^ MC steps approximately took 1 
hour of computation time on a Pentium III 550 Mhz PC. 
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Figure captions 

Figure 1 

The 46-residue protein crambin. The important secondary/tertiary structural elements are indi- 
cated by different colors: black - helix 1 (residues 6-18, sequence: SIVARSNFNVCRL); red - helix 
2 (23-30, EAICATYT); green - inter-helix contacts; blue - (3-sheet (1-5, 32-35, 38-46). As shown 
by the matching colors, the Qi parameters are defined as the fraction of native contacts in specific 
structural elements: Qi - helix 1; Q2 - helix 2; Q3 - inter-helix contacts; (^4 - /3-sheet. 

Figure 2 

Helix-coil transition trajectories (left panels) and energy histograms (right panels) for helix 1 at 
various temperatures: above (T = 1.35; top panels), near (T = 1.175; middle panels), and below 
(T = 0.9; bottom panels) transition temperature. 

Figure 3 

Summary of helix-coil transition thermodynamics, a & b. Transition curves for average E and 
backbone dRMS for helices 1 and 2, respectively. Note the difference in scales along the y-axis. c. 
Free energy of helix 1 formation at various temperatures (see Methods). The backbone dRMS is 
chosen as the order parameter. Near the transition temperature (T = 1.175 - 1.225), two distinct 
free energy minima appear (labeled "helix" and "coil"), d. Heat capacity (C„) curves for helices 
1 and 2. was computed via the relation, C„ = {{E^) — {E)^) /T^, where () refers to ensemble 
averages ( [McQuarrie, 1976| ). 
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Figure 4 

Summary of helix-coil transition kinetics, a. Median first passage time (FPT) for helices 1 and 
2. b-d. Ensemble averaged kinetics of the early events during helix 1 formation at T = 0.7 and 
T = 1.175. This represents the average fraction (Q) of backbone amide (b &c d) and carbonyl (c 
8z e) native contacts observed in 100 runs as a function of MC steps. An increase in the amide Q 
reflects helix formation towards the N-terminus, while an increase in the carbonyl Q reflects helix 
formation towards the C-terminus. Note the difference in scales along the y-axes. 

Figure 5 

The energy and heat capacity (shown as insert) of crambin as a function of temperature (T) . The 
energy data (•) were collected from uncorrelated structures sampled from simulations of 10^ MC 
steps initiated from the native state. For T < 2.0, the energy data was fitted to a linear function 
= 0.987), while for T > 2.0, the exponential function f{E, T) = + (K - ^n) exp(-C/T + 
D)/{1 + exp(-C/r + D)) (C = 70.82, D = 34.16, E„ = native energy = -658, Eu = unfolded 
energy = 16.67; estimated a ~ 90) was used. The heat capacity (Cy) was obtained by evaluating 
Cv — dE/dT on the fitted curves. We note that because our simulation does not exphcitly model 
solvent-protein interactions, the computed heat capacity falls to zero for T > Ty, in contrast to 
experimental observations. 

Figure 6 

Typical folding runs at various temperatures. The upper panel for each run shows backbone (black) 
and sidechain (red) dRMS as the runs progresses. The lower panel tracks the four Qi values, with 
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the color coding shown in Figure |^a. Note that the lengths are different for each run. a. Collapse- 
rate limited cooperative folding at high temperature (T = 1.875). The secondary structure follows 
the sequence of events observed in the fast-folding pathway (events 1-5 in Figure |^ are labeled in the 
Qi plot), b. A trajectory ending at a helix 2 misfold at low temperature (T = 1.25). c. Successful 
folding after encountering a helix 1 misfold at high temperature (T = 1.775). This corresponds to 
pathway B in Figure |^. d. Successful folding after encountering a /?-sheet misfold at high temper- 
ature (T = 1.825). The /?-sheet misfold is metastable at this temperature, e. A trajectory ending 
at a low temperature sidechain-packing trap (T = 1.425). f. Broken ergodicity for T < T* 1.6. 
For each temperature, 52 runs which folded to near-native conformations (dRMS < 1.25 and QiS 
> 0.6) were each extended for an additional 25 x 10^ steps to allow further relaxation. The average 
of these extended runs are indicated by o. The average values obtained from simulations started 
from the native state (see Figure |^) are indicated by •. 

Figure 7 

Summary of the folding kinetics. The successive events observed along the fast-folding pathway are 
marked by boxed numbers. Note that the sidechain packing trap is not indicated on this figure. 

Figure 8 

Energy and backbone dRMS kinetics histograms as a function of MC time of all runs at high 
(1.7 < T < 1.8; panel a), middle (1.5 < T < 1.7; panel b), low (T < 1.5; panel c) temperatures. 
The color indicates the fraction of runs at a particular energy or dRMS at a given MC time. The 
color scale goes from blue (0.0) to red (0.1). Because the runs were terminated as they folded, 
after a run has folded it no longer contributes to the histogram. For this reason, as time progresses 
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and more runs fold, the color of the histograms moves towards blue. At all times, the histogram 
is normalized by the total number of runs at step 0. The data may be viewed as being obtained 
by a hypothetical experiment that records the energy and dRMS of all non-folded structures as the 
simulation progresses. The label "hairpin" refers to the formation of inter-helix contacts. 

Figure 9 

Summary of the hairpin formation (residues 6-30) formation kinetics, a-c. Ensemble kinetics for 
various temperatures. The fraction of inter- and intra-helix native contacts is plotted as a function 
of MC time. The averages were taken over 32 runs of length5.0 x 10^. d &; e. A typical folding 
trajectory at T = 0.4. This trajectory exhibits diffusion-collision behavior. 
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